knitr::opts_chunk$set(echo = TRUE)
library(vegan)
library(ape)
library(tidyverse)
library(vroom)
library(recluster)
library(phytools)
library(stats)
library(cluster)
library(viridis)
This is a follow-along document reporting my engagement with Coding Club’s Introduction to ordination tutorial. Not everything in the tutorial should be expected to be replicated here.
In this tutorial, we will learn to use ordination to explore patterns in multivariate ecological datasets. We will mainly use the vegan package to introduce you to three (unconstrained) ordination techniques: Principal Component Analysis (PCA), Principal Coordinate Analysis (PCoA) and Non-metric Multidimensional Scaling (NMDS).
Ordination is a collective term for multivariate techniques which summarize a multidimensional dataset in such a way that when it is projected onto a low dimensional space, any intrinsic pattern the data may possess becomes apparent upon visual inspection (Pielou, 1984).
In ecological terms: Ordination summarizes community data (such as species abundance data: samples by species) by producing a low-dimensional ordination space in which similar species and samples are plotted close together, and dissimilar species and samples are placed far apart. Ideally and typically, dimensions of this low dimensional space will represent important and interpretable environmental gradients.
Generally, ordination techniques are used in ecology to describe relationships between species composition patterns and the underlying environmental gradients (e.g. what environmental variables structure the community?). Two very important advantages of ordination is that 1) we can determine the relative importance of different gradients and 2) the graphical results from most techniques often lead to ready and intuitive interpretations of species-environment relationships.
To give you an idea about what to expect from this ordination course today, we’ll run the following code.
# Load the community dataset wich we'll use in the examples today
data(varespec)
# Open the dataset and look if you can find any patterns
# It is probably very difficult to see any patterns by just looking at the data frame!
# With this command, you'll perform a NMDS and lot the results.
varespec %>%
metaMDS(trace = F) %>%
ordiplot(type = 'none') %>%
text('sites')
Ordination and classification (or clustering) are the two main classes of multivariate methods that community ecologists employ. To some degree, these two approaches are complementary. Classification, or putting samples into (perhaps hierarchical) classes, is often useful when one wishes to assign names to, or to map, ecological communities. However, given the continuous nature of communities, ordination can be considered a more natural approach. Ordination aims at arranging samples or species continuously along gradients.
If you want to know how to do a classification, please check out our Intro to data clustering.
In this section you will learn more about how and when to use the three main (unconstrained) ordination techniques:
PCA uses a rotation of the original axes to derive new axes, which maximize the variance in the data set. In 2D, this looks as follows:
Computationally, PCA is an eigenanalysis. The most important consequences of this are:
In most applications of PCA, variables are often measured in different units. For example, PCA of environmental data may include pH, soil moisture content, soil nitrogen, temperature and so on. For such data, the data must be standardized to zero mean and unit variance. For ordination of ecological communities, however, all species are measured in the same units, and the data do not need to be standardized.
Let´s have a look how to do a PCA in R. You can use several packages to perform a PCA: The rda() function in the package vegan, The prcomp() function in the package stats and the pca() function in the package labdsv. We will use the rda() function and apply it to our varespec dataset.
PCA <- rda(varespec, scale = FALSE)
# Use scale = TRUE if your variables are on different scales (e.g. for abiotic variables).
# Here, all species are measured on the same scale
# So use scale = FALSE
# Now plot a bar plot of relative eigenvalues. This is the percentage variance explained by each axis
barplot(as.vector(PCA$CA$eig)/sum(PCA$CA$eig))
# How much of the variance in our dataset is explained by the first principal component?
# Calculate the percent of variance explained by first two axes
sum((as.vector(PCA$CA$eig)/sum(PCA$CA$eig))[1:2])
## [1] 0.7927453
# Also try to do it for the first three axes
sum(as.vector(PCA$CA$eig/sum(PCA$CA$eig))[1:3])
## [1] 0.8651851
# Now, we`ll plot our results with the plot function
plot(PCA)
plot(PCA, display = 'sites', type = 'points')
plot(PCA, display = 'species', type = 'text')
# Try to display both species and sites with points.
plot(PCA, display = c('sites', 'species'), type = 'points')
# You can extract the species and site scores on the new PC for further analyses:
sitePCA <- PCA$CA$u # Site scores
speciesPCA <- PCA$CA$v # Species scores
# In a biplot of PCA, species' scores are drawn as arros that point in the direction of increasing values for that variable.
biplot(PCA, choices = c(1,2), type = c('text', 'text'), xlim = c(-5, 10))
## Warning in arrows(0, 0, g$species[, 1] * arrlen, g$species[, 2] * arrlen, :
## zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, g$species[, 1] * arrlen, g$species[, 2] * arrlen, :
## zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, g$species[, 1] * arrlen, g$species[, 2] * arrlen, :
## zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, g$species[, 1] * arrlen, g$species[, 2] * arrlen, :
## zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, g$species[, 1] * arrlen, g$species[, 2] * arrlen, :
## zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, g$species[, 1] * arrlen, g$species[, 2] * arrlen, :
## zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, g$species[, 1] * arrlen, g$species[, 2] * arrlen, :
## zero-length arrow is of indeterminate angle and so skipped
biplot(PCA, choices = c(1,3), type = c('text', 'points'), xlim = c(-5, 10))
## Warning in arrows(0, 0, g$species[, 1] * arrlen, g$species[, 2] * arrlen, :
## zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, g$species[, 1] * arrlen, g$species[, 2] * arrlen, :
## zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, g$species[, 1] * arrlen, g$species[, 2] * arrlen, :
## zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, g$species[, 1] * arrlen, g$species[, 2] * arrlen, :
## zero-length arrow is of indeterminate angle and so skipped
In contrast to some of the other ordination techniques, species are represented by arrows. This implies that the abundance of the species is continuously increasing in the direction of the arrow, and decreasing in the opposite direction. Thus PCA is a linear method. PCA is extremely useful when we expect species to be linearly (or even monotonically) related to each other. Unfortunately, we rarely encounter such a situation in nature. It is much more likely that species have a unimodal species response curve:
Principal coordinates analysis (PCoA, also known as metric multidimensional scaling) attempts to represent the distances between samples in a low-dimensional, Euclidean space. In particular, it maximizes the linear correlation between the distances in the distance matrix, and the distances in a space of low dimension (typically, 2 or 3 axes are selected). The PCoA algorithm is analogous to rotating the multidimensional object such that the distances (lines) in the shadow are maximally correlated with the distances (connections) in the object:
The first step of a PCoA is the construction of a (dis)similarity matrix. While PCA is based on Euclidean distances, PCoA can handle (dis)similarity matrices calculated from quantitative, semi-quantitative, qualitative, and mixed variables. As always, the choice of (dis)similarity measure is critical and must be suitable to the data in question. If you want to know more about distance measures, please check out our Intro to data clustering. For abundance data, Bray-Curtis distance is often recommended. You can use Jaccard index for presence/absence data. When the distance metric is Euclidean, PCoA is equivalent to Principal Components Analysis. Although PCoA is based on a (dis)similarity matrix, the solution can be found by eigenanalysis. The interpretation of the results is the same as with PCA.
# First step is to calculate a distance matrix.
# Here we use Bray-Curtis distance metric.
dist <- vegdist(varespec, method = 'bray')
# PCoA is not included in vegan.
# We will use the ape package instead
PCOA <- pcoa(dist)
# Plot the eigenvalues and interpret
barplot(PCOA$values$Relative_eig[1:10])
# Can you also calculate the cumulative explained variance of the first 3 axes?
sum(PCOA$values$Relative_eig[1:3])
## [1] 0.7331077
# Some distance measures may result in negative eigenvalues. In that case, add a correction:
PCOA <- pcoa(dist, correction = 'cailliez')
# Plot your results
biplot.pcoa(PCOA)
# You see what`s missing? Indeed, there are no species plotted on this biplot. That's because we used a dissimilarity matrix (sites x sites) as input for the PCOA function. Hence, no species scores could be calculated. However, we could work around this problem like this:
biplot.pcoa(PCOA, varespec)
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
# Extract the plot scores from first two PCoA axes (if you need them):
PCOAaxes <- PCOA$vectors[,c(1,2)]
# Compare this result with the PCA plot
par(mfrow = c(1, 2))
biplot.pcoa(PCOA, varespec)
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
plot(PCA)
# reset plot window
par(mfrow = c(1, 1))
PCoA suffers from a number of flaws, in particular the arch effect (see PCA for more information). These flaws stem, in part, from the fact that PCoA maximizes a linear correlation. Non-metric Multidimensional Scaling (NMDS) rectifies this by maximizing the rank order correlation.
NMDS attempts to represent the pairwise dissimilarity between objects in a low-dimensional space. Any dissimilarity coefficient or distance measure may be used to build the distance matrix used as input. NMDS is a rank-based approach. This means that the original distance data is substituted with ranks. Thus, rather than object A being 2.1 units distant from object B and 4.4 units distant from object C, object C is the “first” most distant from object A while object C is the “second” most distant. While information about the magnitude of distances is lost, rank-based methods are generally more robust to data which do not have an identifiable distribution.
NMDS is an iterative algorithm. NMDS routines often begin by random placement of data objects in ordination space. The algorithm then begins to refine this placement by an iterative process, attempting to find an ordination in which ordinated object distances closely match the order of object dissimilarities in the original distance matrix. The stress value reflects how well the ordination summarizes the observed distances among the samples.
NMDS is not an eigenanalysis. This has three important consequences: + There is no unique ordination result. + The axes of the ordination are not ordered according to the variance they explain. + The number of dimensions of the low-dimensional space must be specified before running the analysis.
There is no unique solution. The end solution depends on the random placement of the objects in the first step. Running the NMDS algorithm multiple times to ensure that the ordination is stable is necessary, as any one run may get “trapped” in local optima which are not representative of true distances. Note: this automatically done with the metaMDS() in vegan.
Axes are not ordered in NMDS. metaMDS() in vegan automatically rotates the final result of the NMDS using PCA to make axis 1 correspond to the greatest variance among the NMDS sample points. This doesn’t change the interpretation, cannot be modified, and is a good idea, but you should be aware of it.
A plot of stress (a measure of goodness-of-fit) vs. dimensionality can be used to assess the proper choice of dimensions. The stress values themselves can be used as an indicator. Stress values >0.2 are generally poor and potentially uninterpretable, whereas values <0.1 are good and <0.05 are excellent, leaving little danger of misinterpretation. Stress values between 0.1 and 0.2 are useable but some of the distances will be misleading. Finding the inflexion point can instruct the selection of a minimum number of dimensions.
Methodology of NMDS:
# First step is to calculate a distance matrix. See PCOA for more information about the distance measures
# Here we use bray-curtis distance, which is recommended for abundance data
dist <- vegdist(varespec, method = 'bray')
# In this part, we define a function NMDS.scree() that automatically
# performs a NMDS for 1-10 dimensions and plots the nr of dimensions vs the stress
NMDS.scree <- function(x) {
plot(
rep(1,10),
replicate(10, metaMDS(x, autotransform = F, k = 1, trace = F)$stress),
xlim = c(1, 10),
ylim = c(0, 0.10),
xlab = '# of Dimensions',
ylab = 'Stress',
main = 'NMDS stress plot'
)
for (i in 2:10) {
points(
rep(i, 10),
replicate(10, metaMDS(x, autotransform = F, k = i, trace = F)$stress)
)
}
}
# Use the function that we just defined to choose the optimal nr of dimensions
NMDS.scree(dist)
We further see on this graph that the stress decreases with the number of dimensions. This is a normal behavior of a stress plot. This graph doesn´t have a very good inflexion point. So here, you would select a nr of dimensions for which the stress meets the criteria. This would be 3-4 D. To make this tutorial easier, let’s select two dimensions. This is also an ok solution. Now, we will perform the final analysis with 2 dimensions
# Because the final result depends on the initial random placement of points we'll set a seed to make the results reproducible.
set.seed(2)
# Here, we perform the final analysis and check the result
NMDS1 <- metaMDS(dist, k = 2, trymax = 100, trace = F)
# Let's check the results
NMDS1
##
## Call:
## metaMDS(comm = dist, k = 2, trymax = 100, trace = F)
##
## global Multidimensional Scaling using monoMDS
##
## Data: dist
## Distance: bray
##
## Dimensions: 2
## Stress: 0.1000211
## Stress type 1, weak ties
## Best solution was repeated 7 times in 20 tries
## The best solution was from try 12 (random start)
## Scaling: centring, PC rotation, halfchange scaling
## Species: scores missing
# If you don't provide a dissimilarity matrix, metaMDS automatically applies Bray-Curtis. So in our case, the results have to be the same
NMDS2 <- metaMDS(varespec, k = 2, trymax = 100, trace = F)
NMDS2
##
## Call:
## metaMDS(comm = varespec, k = 2, trymax = 100, trace = F)
##
## global Multidimensional Scaling using monoMDS
##
## Data: wisconsin(sqrt(varespec))
## Distance: bray
##
## Dimensions: 2
## Stress: 0.1843196
## Stress type 1, weak ties
## Best solution was repeated 2 times in 20 tries
## The best solution was from try 4 (random start)
## Scaling: centring, PC rotation, halfchange scaling
## Species: expanded scores based on 'wisconsin(sqrt(varespec))'
The results are not the same! Can you see the reason why? metaMDS() has indeed calculated the Bray-Curtis distances, but first applied a square root transformation on the community matrix. Check the help file for metaNMDS() and try to adapt the function for NMDS2, so that the automatic transformation is turned off.
NMDS2 <- metaMDS(varespec, k = 2, trymax = 100, trace = F, autotransform = F)
NMDS2
##
## Call:
## metaMDS(comm = varespec, k = 2, trymax = 100, autotransform = F, trace = F)
##
## global Multidimensional Scaling using monoMDS
##
## Data: varespec
## Distance: bray
##
## Dimensions: 2
## Stress: 0.1000211
## Stress type 1, weak ties
## Best solution was repeated 11 times in 20 tries
## The best solution was from try 4 (random start)
## Scaling: centring, PC rotation, halfchange scaling
## Species: expanded scores based on 'varespec'
Let’s check the results of NMDS1 with a stressplot
stressplot(NMDS1)
There is a good non-metric fit between observed dissimilarities (in our distance matrix) and the distances in ordination space. Also the stress of our final result was ok (do you know how much the stress is?). So we can go further and plot the results:
plot(NMDS1, type = 't')
## species scores not available
There are no species scores (same problem as we encountered with PCoA). We can work around this problem, by giving metaMDS the original community matrix as input and specifying the distance measure.
NMDS3 <- metaMDS(varespec, k = 2, trymax = 100, trace = F, autotransform = FALSE, distance = 'bray')
plot(NMDS3)
plot(NMDS3, display = 'sites', type = 'n')
points(NMDS3, display = 'sites', col = 'red', cex = 1.25)
text(NMDS3, display = 'species')
# Alternatively, you can use the functions ordiplot and orditorp
ordiplot(NMDS3, type = 'n')
orditorp(NMDS3, display = 'species', col = 'red', air = 0.01)
orditorp(NMDS3, display = 'sites', cex = 1.1, air = 0.01)
We now have a nice ordination plot and we know which plots have a similar species composition. We also know that the first ordination axis corresponds to the largest gradient in our dataset (the gradient that explains the most variance in our data), the second axis to the second biggest gradient and so on. The next question is: Which environmental variable is driving the observed differences in species composition? We can do that by correlating environmental variables with our ordination axes. Therefore, we will use a second dataset with environmental variables (sample by environmental variables). We continue using the results of the NMDS.
# Load the second dataset
data(varechem)
# The function envfit will add the environmental variables as vectors to the ordination plot
ef <- envfit(NMDS3, varechem, permu = 999)
ef
##
## ***VECTORS
##
## NMDS1 NMDS2 r2 Pr(>r)
## N 0.01870 -0.99983 0.1885 0.118
## P 0.66077 0.75059 0.2898 0.024 *
## K 0.85358 0.52096 0.1830 0.117
## Ca 0.83356 0.55243 0.3534 0.012 *
## Mg 0.84814 0.52977 0.2901 0.029 *
## S 0.41304 0.91071 0.0884 0.373
## Al -0.98935 0.14556 0.4888 0.003 **
## Fe -0.99350 0.11384 0.4081 0.003 **
## Mn 0.99233 0.12358 0.4293 0.006 **
## Zn 0.92740 0.37407 0.1650 0.152
## Mo -0.85989 -0.51047 0.0476 0.593
## Baresoil 0.91936 -0.39341 0.1877 0.103
## Humdepth 0.98855 0.15086 0.4415 0.001 ***
## pH -0.86063 0.50923 0.2252 0.077 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Permutation: free
## Number of permutations: 999
# The two last columns are of interest: the squared correlation coefficient and the associated p-value
# Plot the vectors of the significant correlations and interpret the plot
plot(NMDS3, type = 't', display = 'sites')
plot(ef, p.max = 0.05)
It´s easy as that. Next, let’s say that the we have two groups of samples. This could be the result of a classification or just two predefined groups (e.g. old versus young forests or two treatments). Now, we want to see the two groups on the ordination plot. Here is how you do it:
# Define a group variable (first 12 samples belong to group 1, last 12 samples to group 2)
group = c(rep('Group1', 12), rep('Group2', 12))
# Create a vectorr of color values with same length as the vector of group value
colors = c(rep('red', 12), rep('blue', 12))
# Plot convex hulls with colors based on the group identity
ordiplot(NMDS3, type = 'n')
for(i in unique(group)){
ordihull(NMDS3$points[grep(i, group),], draw = 'polygon',
groups = group[group == i], col = colors[grep(i, group)], label = F)
}
orditorp(NMDS3, display = 'species', col = 'red', air = 0.01)
orditorp(NMDS3, display = 'sites', col = c(rep('red', 12),
rep('blue', 12)), air = 0.01, cex = 1.25)
Congratulations! You´ve made it to the end of the tutorial! Now you can put your new knowledge into practice with a couple of challenges.
Perform an ordination analysis on the dune dataset (use data(dune) to import) provided by the vegan package. Interpret your results using the environmental variables from dune.env.
Since no ordination method is specified, let’s try all of them.
# Load the data
data(dune)
data(dune.env)
cPCA <- rda(dune, scale = FALSE)
barplot(as.vector(cPCA$CA$eig)/sum(cPCA$CA$eig))
sum(as.vector(cPCA$CA$eig/sum(cPCA$CA$eig))[1:2])
## [1] 0.510462
plot(cPCA)
biplot(cPCA, choices = c(1,2), type = c('text', 'text'))
ef_cPCA <- envfit(cPCA, dune.env, permutations = 999)
plot(cPCA, type = 't', display = 'sites')
plot(ef_cPCA, p.max = 0.05)
# We start by calculating the distance matrix.
c_dist <- vegdist(dune, method = 'bray')
cPCOA <- pcoa(c_dist)
# Then visualize the eigenvalues to check whether negative eigenvalues are present. Negative eigenvalues are associated with non-euclidean space.
print(as.vector(cPCOA$values$Relative_eig))
## [1] 0.3992224835 0.2378210860 0.1073416477 0.0889153796 0.0649297805
## [6] 0.0550429684 0.0393392674 0.0223876913 0.0173276057 0.0143548929
## [11] 0.0127797580 0.0044601518 0.0037494514 0.0009306563 0.0000000000
## [16] -0.0061467765 -0.0099690103 -0.0127317739 -0.0172418431 -0.0225134168
# There are negative values so a correction is needed. Let's just use cailliez as in the tutorial.
cPCOA <- pcoa(c_dist, correction = 'cailliez')
# See how much of the variance in the distance matrix is explained by the first two principal coordinates that wea are going to use anyway.
cPCOA$values$Cum_corr_eig[1:2]
## [1] 0.3026888 0.4895187
sum(cPCOA$values$Rel_corr_eig[1:2])
## [1] 0.4895187
# Use a barplot to visualize more of the variance attributed to each principal coordinate.
barplot(as.vector(cPCOA$values$Rel_corr_eig))
barplot(as.vector(cPCOA$values$Cum_corr_eig))
# We proceed to plot the biplot with labels from the original dataset.
biplot.pcoa(cPCOA, dune)
# Let's now fit the environmental variables.
site_scores_pcoa <- cPCOA$vectors[, 1:2]
ef_cPCOA <- envfit(site_scores_pcoa, dune.env, permutations = 999)
# Show plots.
ordiplot(site_scores_pcoa, type = 'text', cex = .8)
## species scores not available
plot(ef_cPCOA, p.max = 0.05, col = 'blue', cex = 0.9)
Quick recap on the NMDS steps: + Step 1: Perform NMDS with 1 to 10 dimensions + Step 2: Check the stress vs dimension plot + Step 3: Choose optimal number of dimensions + Step 4: Perform final NMDS with that number of dimensions + Step 5: Check for convergent solution and final stress
# Steps 1 and 2:
NMDS.scree2 <- function (x) {
# Step 1: Perform NMDS with 1 to 10 dimensions.
a <- list()
for (i in 1:10) {
a[[i]] <- replicate(
10,
metaMDS(x, autotransform = F, k = i, trace = F)
)
}
# Step 2: Check the stress vs dimension plot.
ytemp = c()
for (i in 1:10){
ytemp[i] <- a[[1]][, i]$stress
}
plot(
x = rep(1, 10),
y = ytemp,
xlim = c(1, 10),
ylim = c(0 ,.3)
)
for (i in 2:10){
ytemp2 = c()
for (j in 1:10){
ytemp2[j] <- a[[i]][, j]$stress
}
points(
rep(i, 10),
ytemp2
)
}
return(a)
}
c_NMDS_result <- NMDS.scree2(c_dist)
# Step 3: Choose the optimal number of dimensions.
# In this case we can see that 2 dimensions is between 0.1 and 0.2 stress which is usable according to the tutorial but will produce misleading distances. We were going to use 2 dimensions anyways so I am just acknolwedging the limitations of choosing 2 dimensions.
# Step 4: Perform final NMDS with that number of dimensions. In this case, 2 dimensions.
c_NMDS <- metaMDS(dune, k = 2, trymax = 100, trace = F, autotransform = F, distance = 'bray')
# Step 5: Check for a convergent solution and final stress.
print(c_NMDS)
##
## Call:
## metaMDS(comm = dune, distance = "bray", k = 2, trymax = 100, autotransform = F, trace = F)
##
## global Multidimensional Scaling using monoMDS
##
## Data: dune
## Distance: bray
##
## Dimensions: 2
## Stress: 0.1183186
## Stress type 1, weak ties
## Best solution was repeated 1 time in 20 tries
## The best solution was from try 18 (random start)
## Scaling: centring, PC rotation, halfchange scaling
## Species: expanded scores based on 'dune'
# We can also perform a stressplot.
stressplot(c_NMDS)
# We can also produce a biplot.
plot(c_NMDS, type = 't', cex = 1.1)
# We can make it more elaborate. However, this one leaves out information compared to the simple approach.
plot(c_NMDS, display = 'sites', type = 'n')
points(c_NMDS, display = 'sites', col = 'red', cex = 1.25)
text(c_NMDS, display = 'species')
# We can also use the ordiplot and orditorp functions.
ordiplot(c_NMDS, type = 'n')
orditorp(c_NMDS, display = 'species', cex = 0.8, col = 'red', air = 0.01)
orditorp(c_NMDS, display = 'sites', cex = 0.8, air = 0.01)
# Lastly, we can also fit environmental variables and visualize the fit as a biplot.
c_mnds_ef <- envfit(c_NMDS, dune.env, permutations = 999)
c_mnds_ef
##
## ***VECTORS
##
## NMDS1 NMDS2 r2 Pr(>r)
## A1 0.96473 0.26322 0.365 0.015 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Permutation: free
## Number of permutations: 999
##
## ***FACTORS:
##
## Centroids:
## NMDS1 NMDS2
## Moisture1 -0.5101 -0.0403
## Moisture2 -0.3938 0.0139
## Moisture4 0.2765 -0.4033
## Moisture5 0.6561 0.1476
## ManagementBF -0.4534 -0.0102
## ManagementHF -0.2636 -0.1282
## ManagementNM 0.2958 0.5790
## ManagementSF 0.1506 -0.4670
## UseHayfield -0.1568 0.3248
## UseHaypastu -0.0412 -0.3370
## UsePasture 0.2854 0.0844
## Manure0 0.2958 0.5790
## Manure1 -0.2482 -0.0215
## Manure2 -0.3079 -0.1866
## Manure3 0.3101 -0.2470
## Manure4 -0.3463 -0.5583
##
## Goodness of fit:
## r2 Pr(>r)
## Moisture 0.5014 0.001 ***
## Management 0.4134 0.003 **
## Use 0.1871 0.122
## Manure 0.4247 0.021 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Permutation: free
## Number of permutations: 999
# Straightforward plot with significant correlations.
plot(c_NMDS, type = 't', display = 'sites')
plot(c_mnds_ef, p.max = 0.05)
If you already know how to do a classification analysis, you can also perform a classification on the dune data. Then combine the ordination and classification results as we did above. Please have a look at out tutorial Intro to data clustering, for more information on classification.
First, let’s try first the phylo tree approach. This approach
involves clustering the data with recluster.cons to
generate a phylo object. Then, one selects an appropriate
level to cut the tree so that relevant clads are identified.
set.seed(123)
# Convert the dune dataframe into a matrix.
dune_matrix <- as.matrix(dune)
# Remove uniques.
dune_trim <- dune_matrix[, which(!colSums(dune_matrix) == 1)]
# Create phylo object.
dune_upgma <- recluster.cons(dune_trim, tr = 100, p = 0.5, dist = "(A+B-2*J)/(A+B)", method = 'average')$cons
# Identify polytomies.
dune_upgma_nodi <- di2multi(dune_upgma)
# Let's visualize what the graph looks like.
plot(dune_upgma_nodi)
plot(dune_upgma_nodi)
nodelabels()
plot(dune_upgma_nodi)
tiplabels()
getwd()
## [1] "/home/edwiny/Local Cloud Copies/OneDrive/Edwin YO/Learning/Coding Club Github/CodingClubTutorials/CC-Introduction-to-ordination"
# From the dendogram, there are 4 main clads with node numbers: 35, 26, 25, and 22
clad_nodes <- c(32, 31, 28, 22) # <----- IMPORTANT: specify the clad nodes here.
dune_upgma_descendants <- purrr::map(clad_nodes, ~ c(getDescendants(dune_upgma_nodi, .x)))
dune_upgma_tips <- purrr::map(dune_upgma_descendants, ~
c(na.omit(dune_upgma_nodi$tip.label[.x])))
# Generate the colors corresponding to each group.
n_colors <- length(clad_nodes)
dune_colors <- colorspace::qualitative_hcl(n_colors)
# Name the list by their group.
dune_group_names <- purrr::map_chr(1:length(clad_nodes), ~ paste0('Group_', .x))
# Assign the names to the tips list.
dune_upgma_tips <- purrr::set_names(dune_upgma_tips, dune_group_names)
# Build a tibble for each element of the list containing site name and group color information.
dune_upgma_ref <- map2(dune_upgma_tips, dune_colors, ~tibble(Sites = .x, Color = .y))
ordiplot(c_NMDS, type = 'n')
purrr::imap(dune_upgma_ref, ~ ordihull(
c_NMDS$points[.x$Sites, ],
draw = 'polygon',
groups = rep(.y, nrow(c_NMDS$points[.x$Sites, ])),
col = .x$Color,
label = F
))
## $Group_1
## $Group_1
## MDS1 MDS2
## 7 -0.5402954 -0.05820807
## 5 -0.6265359 -0.08670667
## 6 -0.5426842 0.11315920
## 11 -0.3383126 0.35082130
## 18 -0.1771953 0.52341798
## 7 -0.5402954 -0.05820807
##
## attr(,"class")
## [1] "ordihull"
##
## $Group_2
## $Group_2
## MDS1 MDS2
## 17 -0.85996101 0.9870715
## 19 -0.07004293 1.0121452
## 17 -0.85996101 0.9870715
##
## attr(,"class")
## [1] "ordihull"
##
## $Group_3
## $Group_3
## MDS1 MDS2
## 16 1.0811009 -0.1796677
## 15 0.8960022 0.2223540
## 14 0.9433166 0.4760783
## 20 1.0424041 0.2526882
## 16 1.0811009 -0.1796677
##
## attr(,"class")
## [1] "ordihull"
##
## $Group_4
## $Group_4
## MDS1 MDS2
## 13 0.4186352 -0.5833357
## 4 -0.1156150 -0.5222423
## 2 -0.5048551 -0.4089409
## 8 0.2811576 -0.1668326
## 12 0.4424704 -0.3640875
## 13 0.4186352 -0.5833357
##
## attr(,"class")
## [1] "ordihull"
pts_tibble <- as_tibble(c_NMDS$points) %>%
rownames_to_column()
info_tibble <- bind_rows(dune_upgma_ref, .id = 'Group')
color_tibble <- pts_tibble %>%
left_join(info_tibble, by = join_by(rowname == Sites))
#orditorp (c_NMDS, display = 'species', cex = 0.8, col = 'black', air = 0.01)
orditorp(c_NMDS, display = 'sites', col = color_tibble %>% pull(Color), cex = 1.25, air = 0.01)
ef_cPCOA2 <- envfit(site_scores_pcoa, dune, permutations = 999)
# We proceed to plot the biplot with labels from the original dataset.
biplot.pcoa(cPCOA, dune)
# Let's now fit the environmental variables.
site_scores_pcoa <- cPCOA$vectors[, 1:2]
ef_cPCOA <- envfit(site_scores_pcoa, dune.env, permutations = 999)
# Show plots.
ordiplot(site_scores_pcoa, type = 'text', cex = .8)
## species scores not available
plot(ef_cPCOA, p.max = 0.05, col = 'blue', cex = 0.9)
ordiplot(site_scores_pcoa, type = 'text', cex = .8)
## species scores not available
plot(ef_cPCOA2, p.max = 0.05, col = 'blue', cex = 0.9)
First, show the clusters but on an ordiplot based on PCOA:
ordiplot(site_scores_pcoa, type = 'n')
## species scores not available
purrr::imap(dune_upgma_ref, ~ ordihull(
site_scores_pcoa[.x$Sites, ],
draw = 'polygon',
groups = rep(.y, nrow(c_NMDS$points[.x$Sites, ])),
col = .x$Color,
label = F
))
## $Group_1
## $Group_1
## Axis.1 Axis.2
## 7 -0.3096374 -0.01907463
## 10 -0.3112373 0.05463911
## 18 -0.1333954 0.24105452
## 7 -0.3096374 -0.01907463
##
## attr(,"class")
## [1] "ordihull"
##
## $Group_2
## $Group_2
## Axis.1 Axis.2
## 19 -0.005394285 0.4691161
## 17 -0.194826335 0.4965459
## 19 -0.005394285 0.4691161
##
## attr(,"class")
## [1] "ordihull"
##
## $Group_3
## $Group_3
## Axis.1 Axis.2
## 16 0.5112138 -0.1108855
## 14 0.4301051 0.0861232
## 20 0.5091990 0.1575301
## 16 0.5112138 -0.1108855
##
## attr(,"class")
## [1] "ordihull"
##
## $Group_4
## $Group_4
## Axis.1 Axis.2
## 13 0.17987224 -0.2151706
## 3 -0.07276681 -0.2908709
## 2 -0.29462318 -0.1860944
## 12 0.19547602 -0.1259336
## 13 0.17987224 -0.2151706
##
## attr(,"class")
## [1] "ordihull"
pts_tibble <- as_tibble(site_scores_pcoa) %>%
rownames_to_column()
info_tibble <- bind_rows(dune_upgma_ref, .id = 'Group')
color_tibble <- pts_tibble %>%
left_join(info_tibble, by = join_by(rowname == Sites))
#orditorp (c_NMDS, display = 'species', cex = 0.8, col = 'black', air = 0.01)
orditorp(site_scores_pcoa, display = 'sites', col = color_tibble %>% pull(Color), cex = 1.25, air = 0.01)
It all got very messy since “simpson” is not a distance metric
provided in vegdist while “bray” is not a distance metric
provided in recluster.cons. So, to prevent mistakes arising
from oversights in the future, the following is an attempt at performing
the clustering with an already provided distance matrix based on
Bray-Curtis.
set.seed(123)
# Convert the dune dataframe into a matrix.
dune_matrix <- as.matrix(dune)
# Remove uniques.
dune_trim <- dune_matrix[, which(!colSums(dune_matrix) == 1)]
# Obtain the distance matrix:
dune_dist <- vegdist(dune_trim, 'bray')
# Create phylo object.
dune_upgma2 <- recluster.cons(dune_dist, tr = 100, p = 0.5, method = 'average')$cons
# Identify polytomies.
dune_upgma2_nodi <- di2multi(dune_upgma2)
# Let's visualize what the graph looks like.
plot(dune_upgma2_nodi)
plot(dune_upgma2_nodi)
nodelabels()
plot(dune_upgma2_nodi)
tiplabels()
# From the dendogram, there are 4 main clads with node numbers: 35, 26, 25, and 22
clad_nodes <- c(32, 31, 28, 22) # <----- IMPORTANT: specify the clad nodes here.
dune_upgma_descendants <- purrr::map(clad_nodes, ~ c(getDescendants(dune_upgma2_nodi, .x)))
dune_upgma_tips <- purrr::map(dune_upgma_descendants, ~
c(na.omit(dune_upgma_nodi$tip.label[.x])))
# Generate the colors corresponding to each group.
n_colors <- length(clad_nodes)
dune_colors <- colorspace::qualitative_hcl(n_colors)
# Name the list by their group.
dune_group_names <- purrr::map_chr(1:length(clad_nodes), ~ paste0('Group_', .x))
# Assign the names to the tips list.
dune_upgma_tips <- purrr::set_names(dune_upgma_tips, dune_group_names)
# Build a tibble for each element of the list containing site name and group color information.
dune_upgma_ref <- map2(dune_upgma_tips, dune_colors, ~tibble(Sites = .x, Color = .y))
ordiplot(site_scores_pcoa, type = 'n')
## species scores not available
purrr::imap(dune_upgma_ref, ~ ordihull(
site_scores_pcoa[.x$Sites, ],
draw = 'polygon',
groups = rep(.y, nrow(c_NMDS$points[.x$Sites, ])),
col = .x$Color,
label = F
))
## $Group_1
## $Group_1
## Axis.1 Axis.2
## 3 -0.07276681 -0.2908709
## 4 -0.06925423 -0.2641976
## 12 0.19547602 -0.1259336
## 13 0.17987224 -0.2151706
## 3 -0.07276681 -0.2908709
##
## attr(,"class")
## [1] "ordihull"
##
## $Group_2
## $Group_2
## Axis.1 Axis.2
## 19 -0.005394285 0.4691161
## 17 -0.194826335 0.4965459
## 19 -0.005394285 0.4691161
##
## attr(,"class")
## [1] "ordihull"
##
## $Group_3
## $Group_3
## Axis.1 Axis.2
## 16 0.5112138 -0.1108855
## 14 0.4301051 0.0861232
## 20 0.5091990 0.1575301
## 16 0.5112138 -0.1108855
##
## attr(,"class")
## [1] "ordihull"
##
## $Group_4
## $Group_4
## Axis.1 Axis.2
## 2 -0.2946232 -0.18609437
## 7 -0.3096374 -0.01907463
## 10 -0.3112373 0.05463911
## 18 -0.1333954 0.24105452
## 2 -0.2946232 -0.18609437
##
## attr(,"class")
## [1] "ordihull"
pts_tibble <- as_tibble(site_scores_pcoa) %>%
rownames_to_column()
info_tibble <- bind_rows(dune_upgma_ref, .id = 'Group')
color_tibble <- pts_tibble %>%
left_join(info_tibble, by = join_by(rowname == Sites))
#orditorp (c_NMDS, display = 'species', cex = 0.8, col = 'black', air = 0.01)
orditorp(site_scores_pcoa, display = 'sites', col = color_tibble %>% pull(Color), cex = 1.25, air = 0.01)
cPCA <- rda(dune, scale = FALSE)
plot(cPCA)
biplot(cPCA, choices = c(1,2), type = c('text', 'text'))
ef_cPCA2 <- envfit(cPCA, dune, permutations = 999)
plot(cPCA, type = 't', display = 'sites')
plot(ef_cPCA2, p.max = 0.05)